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[57] ABSTRACT 

A method and a device for discriminating a voiced sound 
from an unvoiced sound or background noise in speech 
signals are disclosed. Each block or frame of input speech 
signals is divided into plural sub-blocks and the standard 
deviation, effective value or the peak value is detected in a 
detection unit for detecting statistical characteristics from 
one sub-block to another. A bias detection unit detects a bias 
on the time scale of the standard deviation, effective value 
or the peak value to decide whether the speech signals are 
voiced or unvoiced from one block to another. 

17 Claims, 13 Drawing Sheets 
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METHOD AND DEVICE FOR 
DISCRIMINATING VOICED AND UNVOICED 
SOUNDS 

This application is a division of application Ser. No. 5 
08/048,034, filed Apr. 14, 1993, pending, which is hereby 
incorporated by reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to a method and a device for 
making discrimination between the voiced sound and the 
noise or the unvoiced sound in speech signals. 

2. Statement of Related Art 15 
The speech or voice is classified into the voiced sound and 

the unvoiced sound. The voiced sound is the voice accom- 
panied by vibrations of the vocal cord and consists in 
periodic vibrations. The unvoiced sound is the voice not 
accompanied by vibrations of the vocal cord and consists in ^ 
non-periodic vibrations. The usual speech is composed 
mainly of the voiced sound, with the unvoiced sound being 
a special consonant termed unvoiced consonant. The period 
of the voiced sound is determined by the period of the 
vibrations of the vocal cord and is termed the pitch period, 25 
a reciprocal of which is termed a pitch frequency. In the 
following description, the term pitch means a pitch period. 
The pitch period and the pitch frequency are cmcial factors 
on which depend highness or lowness of the speech or the 
intonation. Thus the sound quality of the speech depends on 
how precisely the pitch is grasped. However, in grasping the 
pitch, it is necessary to take account of the noise around the 
speech, or so-called background noise as well as quantiza- 
tion noise produced on quantization of analog signals into 
digital signals. In encoding speech signals, it is crucial to 35 
make distinction between the voiced sound from these 
noises and the unvoiced sound. 

Among analog speech analysis systems, hitherto known 
in the art, there are such systems as disclosed in U.S. Pat. 
Nos. 4,637,046 and 4,625327. In the former, input analog ^ 
speech signals are divided into segments in the chronologi- 
cal sequence,and signals contained in these segments are 
rectified to find a maximum value which is compared to a 
threshold value to make a voice/unvoiced decision. In the 
latter, analog speech signals are converted into digital sig- 
nals and divided into segment and'discfete Fourier transform 
is carried out from segment to segment to find an absolute 
value for each spectrum which is then compared to a 
threshold value to make a voiced/unvoiced decision. 

Specific examples of encoding of speech signals include 
multi-band excitation coding (MBE), single band excitation 
coding (SBE), harmonic coding, sub-band coding (SBC), 
linear predictive coding (LPC), discrete cosine transform 
(DCT), modified DCT (MDCT) and fast Fourier transform „ 
(FFT). 

For extracting the pitch from the input speech signal 
waveform by MBE coding, for example, pitch extraction 
may be achieved easily even if the pitch is not represented 
manifestly. For decoding at the synthesis side, a voiced go 
sound waveform on the time domain is synthesized based on 
the pitch so as to be added to a separately synthesized 
unvoiced sound waveform on the lime domain. 

Meanwhile, if the pitch is adapted to be extracted easily, 
it may occur that a pitch that is not a true pitch be extracted 65 
in background noise segments. If such pitch other than the 
true pitch be extracted by MBE encoding, cosine waveform 



2 

synthesis is performed so that peak points of the cosine 
waves arc overlapped with one another at a pitch which is 
not the true pitch. That is, the cosine waves are synthesized 
by addition at a fixed phase (0-phase or 7z/2 phase) in such 
a manner that the voiced sound is synthesized at a pitch 
period which is not the true pitch period, such that the 
background noise devoid of the pitch is synthesized as a 
periodic impulse wave. In other words, amplitude intensities 
of the background noise, which intrinsically should be 
scattered on the time axis, are concentrated in a frame 
portion, with certain periodicity to produce an extremely 
obtrusive extraneous sound. 

SUMMARY OF THE INVENTION 

In view of the above-depicted status of the art, it is an 
object of the present invention to provide a method for 
making discrimination between voiced and unvoiced sounds 
whereby the voiced sound may positively be distinguished 
from the noise or unvoiced sound for preventing obtrusive 
extraneous sound from being produced during speech syn- 
thesis. 

In one aspect, the present invention provides a method for 
discriminating a voiced sound from unvoiced sound or noise 
in input speech signals by dividing the input speech signals 
into blocks and giving a decision for each of these blocks as 
to whether or not the speech signals are voiced comprising 
the steps of subdividing one-block signals into a plurality of 
sub -blocks, finding statistical characteristics of the signals 
fi-om one sub-block to another, and deciding whether or not 
the speech signals arc voiced depending on a bias of the 
statistical characteristics on the time scale. 

The peak value, effective value or the standard deviation 
of the signals for each of the sub-blocks may be employed 
as the aforementioned statistical characteristics. 

In another aspect, the present invention provides a method 
for discriminating a voiced sound from an unvoiced sound 
or noise in input speech signals by dividing the input speech 
signals into blocks and giving a decision for each of these 
blocks as to whether or not the speech signals are voiced 
comprising the steps of finding the energy distribution of 
one-block signals on the frequency scale, finding the signal 
level of said one-block signals, and deciding whether or not 
the speech signals are voiced depending on the energy 
distribution and the signal level of one -block signals on the 
fi-equency scale. . „ _ 

Such voiced/unvoiced decision may also be made depend- 
ing on the statistical characteristics of sub-block signals, 
namely the effective value, the standard deviation or the 
peak value and energy distribution of one block signals on 
the frequency scale, or alternatively, on the statistical char- 
acteristics of the sub-block signals, namely the effective 
value, the standard deviation or the peak value and the signal 
level of one-block signals. 

In still another aspect, the present invention provides a 
method for discriminating a voiced sound from unvoiced 
sound or noise in input speech signals by dividing the input 
speech signals into blocks and giving a decision for each of 
these blocks as to whether or not the speech signals are 
voiced comprising the steps of subdividing one-block sig- 
nals into a plurality of sub-blocks, finding statistical char- 
acteristics of the signals, that is effective value, standard 
deviation or peak value, from one sub-block to another, 
finding the energy distribution of the one-block signals on 
the frequency scale, finding the signal level of the one-block 
signals on the frequency scale, and deciding whether or not 
the speech signals are voiced depending on the effective 
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value, standard deviation or the peak value, the energy 
distribution of the one -block signals on the frequency scale, 
and the signal level of the one-block signals on the fre- 
quency scale. 

In yet another aspect, the present invention provides a 
method for discriminating a voiced sound from unvoiced 
sound or noise in input speech signals by dividing the input 
speech signals into blocks and giving a decision for each of 
these blocks as to whether or not the speech signals are 
voiced comprising the steps of subdividing one-block sig- 
nals into a plurality of sub-blocks, finding an effective value 
on the time scale for each of the sub-blocks and finding the 
distribution of the effective values for each of the sub-blocks 
based on the standard deviation and mean value of these 
effective values, finding energy distribution of said one- 
block signals on the frequency scale, finding the level of said 
one-block signals and deciding whether or not the speech 
signals are voiced depending on at least two of the distri- 
bution of the effective value from sub-block to sub-block, 
energy distribution of the one-block signals on the frequency 20 
scale and the level of the one-block signals. 

The decision as to whether or not the speech signals are 
voiced means discriminating the voiced sound from the 
unvoiced sound or noise in the speech signals. 

Hie voiced sound in the speech signals may be discrimi- 
nated from the unvoiced signal or the noise by relying when 
the difference in the bias in the statistical characteristics on 
the time scale between the voiced signals and the unvoiced 
signals or the noise. 

30 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGS, la to Ic are functional block diagrams showing a 
schematic arrangement of a voiced sound discriminating 
device for illustrating a first embodiment of the voiced 
sound discriminating device according to the present inven- 35 
tion. 

FIGS. 2a to 2d are waveform diagrams for illustrating 
statistical characteristics of signals. 

FIGS. 3a and 3^ are functional block diagrams for illus- 
trating an arrangement of essential portions of a voiced/ ^ 
unvoiced discriminating device for illustrating the first 
embodiment, 

FIG. 4 is a functional block diagram showing a schematic 
arrangement of a voiced sound discriminating device for 
illustrating a second embodiment of the voiced sound dis- 
criminating device according to the present invention. " 

FIG. 5 is a functional block diagram showing a schematic 
arrangement of a voiced sound discriminating device for 
illustrating a third embodiment of the voiced sound discrimi- 
nating device according to the present invention. 

FIG. 6 is a functional block diagram showing a schematic 
arrangement of a voiced sound discriminating device for 
illustrating a fourth embodiment of the voiced sound dis- 
criminating device according to the present invention. 

FIGS, la and 7b are waveform diagrams for illustrating 
distribution of short-time rms values as statistic character- 
istics of signals. 

FIG. 8 is a functional block diagram showing a schematic 
arrangement of an analysis side (encoder side) of a speech 
signal synthesis/analysis system as a concrete example of a 
device to which the voiced sound discriminating method 
according to the present invention is applied. 

FIGS. 9a and 9b are graphs for illustrating a windowing 
operation. 

FIG. 10 is a graph for illustrating the relation between the 
windowing operation and a window function. 
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FIG. 11 is a graph showing time-domain data to be 
orthogonally transformed, herein FFT. 

FIG. 12a is a graph showing the intensity of spectral data 
on the frequency domain. 

FIG. 12b is a graph showing the intensity of a spectral 
envelope on the frequency domain, 

FIG, 12c is a graph showing the intensity of a power 
spectrum of excitation signals on the frequency domain. 

FIG, 13 is a functional block diagram showing a sche- 
matic arrangement of a synthesis side (decoder side) of a 
speech signal analysis/synthesis system as a concrete 
example, of a device to which the voiced sound discrimi- 
nating method according to the present invention may be 
applied. 

FIGS. 14a to 14c are graphs for illustrating synthesis of 
unvoiced sound during synthesis of speech signals. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

Referring to the drawings, preferred embodiments of the 
method for making discrimination between voiced and 
unvoiced sounds according to the present invention will be 
explained in detail. 

FIGS, la to Ic show a schematic arrangement of a device 
for making discrimination between voiced and unvoiced 
sounds for illustrating the voiced soimd discriminating 
method according to a first embodiment of the present 
invention. The present first embodiment is a device for 
making discrimination of whether or not the speech signal is 
voiced a sound depending on the bias on the time domain of 
statistical characteristics of speech signals for each of sub- 
blocks of speech signals divided from a block of speech 
signals. 

Referring to FIGS, la and lb, digital speech signals, freed 
of at least low-range signals (with frequencies not higher 
than 200 Hz) for elimination of a dc offset or bandwidth 
limitation to e.g. 200 to 3400 Hz by a high-pass filter (HPF), 
not shown; are supplied to an input terminal 11. These 
signals are transmitted to a windowing or window analysis 
unit 12. In the analysis unit 12, each block of the input digital 
signals consisting of N samples, N being 256, is windowed 
with a rectangular window, so that the input signals are 
sequentially time-shifted an interval of a frame consisting of 
L samples, where L equals 160. An overlap between adjacent 
blocks is (N-L) samples or 96 samples. This technique is 
disclosed in e.g. IEEE Transaction on Acoustics Speech and 
Signal Processing, vol.ASSP-28, No. 1, February 1980, pp. 
90 to 101. Signals of each block, consisting of N samples, 
from the window analysis unit 12, are supplied to a sub - 
block division unit 13, The sub-block division unit 13 
sub -divides the signals of each block from the window 
analysis unit 12 into sub-blocks. The resulting sub-block 
signals are supplied to a detection unit for detecting statis- 
tical characteristics. In the present first embodiment, the 
detection unit is a standard deviation data detection unit 15 
shown in FIG. la, an effective value data detection unit 15' 
shown in FIG. lb or a peak value detection unit 16 in FIG. 
Ic, The standard deviation data from the standard deviation 
data detection unit 15 are supplied to a standani deviation 
bias detection unit 17. The effective value data from the 
effective value data detection unit 15' are supplied to an 
effective value bias detection unit 17'. The detection units 
17, 17' detect the bias of the standard deviation and the 
effective values of each sub-block from the standard value 
data and from the effective value data, respectively. The 
time-base data concerning the bias of the standard deviation 
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01 effective values are supplied to a decision unit 18. The 
decision unit 18 compares the lime-base data concerning the 
bias of the standard deviation values or the effective values 
to a predetermined threshold for deciding whether or not the 
signals of each sub-block are voiced and outputs resulting 
decision data at an output terminal 20. Referring to FIG. Ic, 
peak value data from peak value data detection unit 16 are 
supplied to a peak value bias detection unit 19. The unit 19 
detects the bias of peak values of the time domain signals 
frona the peak value data. The resulting data concerning the 
bias of peak values of the time domain signals are supplied 
to decision unit 18. The unit 18 compares the time -base data 
concerning the bias of the peak values of the signals on the 
time domain to a predetermined threshold for deciding 
whether or not the signals of each sub-block are voiced and 
outputs resulting decision data at an output terminal 20. The 
detection of the effective values, standard deviation values 
and the peak values of the sub-block signals, employed in 
the present embodiment as statistical characteristics, as well 
as the detection of the bias of these values on the time 
domain, is hereinafter explained. 

The reason the standard deviation, effective values or the 
peak values of the sub-block signals are found in the present 
first embodiment is that the standard deviation, effective 
values or the peak values differ significantly on the time 
domain between the voiced sound and the noise or the 
unvoiced sound. For example, the vowel (voiced sound) of 
speech signals shown in FIG. 2a is compared to the noise or 
the consonant (imvoiced sound) thereof shown in FIG. 2c. 
The peak amplitude values of the vowel sound are arrayed 
in an orderly fashion, while exhibiting a bias on the time 
domain, as shown in FIG. 2b, whereas those of the conso- 
nant sound or unvoiced soiind are arrayed in a disorderly 
fashion, although they exhibit certain flatness or uniformity 
on the time domain, as shovra in FIG. 2d. 

The detection imits 15, IS', shown in FIGS, la and lb, for 
detecting the standard value data and the effective value 
data, respectively, from one sub -block to another, and detec- 
tion of the bias of the standard deviation data or the effective 
value data on the time domain, are hereinafter explained. 

The detection unit 15 for detecting standard deviation 
values, shown in FIG, 3a, is made up of a standard deviation 
calculating unit 22 for calculating the standard deviation of 
the input sub -block signals, an arithmetical mean calculating 
unit 23 for calculating an arithmetical mean of the standard 
deviation values, and a'geometrical meancalculating unit 24 
for calculating a geometrical mean of the standard deviation 
values. Similarly, the detection unit 15' for detecting effec- 
tive values, shown in FIG, 36, is made up of an effective 
value calculating unit 22' for calculating the effective values 
for input sub-block signals, an arithmetical mean calculating 
unit 23' for calculating an arithmetical mean of the effective 
values, and a geometrical mean calculating unit 24 for 
calculating a geometrical mean of the effective values. The 
detection units 17, 17' detect bias data on the time domain 
from the arithmetical and the geometrical mean values, 
while the decision unit 18 decides, from the bias data, 
whether or not the sub-block speech signals are voiced, and 
the resulting decision data is outputted at output terminal 20. 

By referring to FIGS, la and lb and FIGS. 3a and 3h, the 
principle of deciding whether or not the speech signals are 
voiced sound based on the above-mentioned energy distri- 
bution is explained. 
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samples is indicated as x(n). The 256-sample block is 
divided by the sample block division unit 13 at an interval 
of 8 samples. Thus an N/B^ («=256/8=32) number of sub- 
blocks, each having a sub -block length are present in 
one block. These 32 sub-block time-domain data are sup- 
plied to e.g. the standard deviation calculating unit 22 of the 
standard deviation data detection unit 15 or of the effective 
value detection unit 15' of the effective data calculating unit 
15'. 

The calculating units 22, 22* output standard deviation 
value a^(i) of the time-domain data, as found by the formula 
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(1) 



at 0^i<N/Bi 
where k=ixBi 
at 0^i<N/Bi 

from one sub-block to another. In the above formula, i is an 
20 index for a sub-block and k is a number of samples, while 
X is a mean value of the input samples for each block. It 
should be noted that the mean value x is not a mean value 
for each sub-block but is a mean value for each block, that 
is a mean value of the N number of samples of each block, 
25 Also it should be noted that the effective value for each 
sub -block is also given by the formula (1) in which (x(n))^, 
that is a root-mean-square (rms) value, is substituted for the 
term (x(n)-X)^. 
The standard deviation a„(i) is supplied to arithmetical 
30 mean calculating unit 23 and to geometrical mean calculat- 
ing unit 24 for checking into signal distribution on the time 
axis. The calculating units 23,24 calculate the arithmetical 
mean a^^^^ and the geometrical mean a^.^^ in accordance 
with formulas (2) and (3): 



35 



2 or„{0 



'~mr iZo 



{N/Bi-l ] ^ 

^0 I 



(2) 



(3) 



It is noted that, while the formulas (1) to (3) are concerned 
only with the standard deviation, similar calculation may be 
made for the effective values as well. 

The arithmetical mean a^.^^^ and the geometrical mean 
' ^mpv* calculated in 'accordance wilh the formulas (1) to' 
(3), are supplied to the standard deviation bias detection unit 
17 or to the effective value bias detection unit 17'. The 
standard deviation bias detection unit 17 or the effective 
value bias detection unit 17' calculate a ratio from the 
arithmetical mean a^^^^ and the geometrical mean a^.^^^^ 
with formula (4). 

Pr^viMdJ^^jnpy (4) 

The ratio py> which is a bias data representing the bias of the 
standard deviation data on the time scale, is suppUed to 
decision unit 18. The decision unit 18 compares the bias data 
(ratio p^) to a predetermined threshold p^/^to decide whether 
or not the sound is voiced. For example, if the threshold 
value prt/is set to 1.1, and the bias data p^is found to be 
larger than it, a decision is given that a deviation from the 
standard deviation or the effective value is larger and hence 
the signal is a voiced sound. Conversely, if the distribution 



data p^is smaller than the threshold value ^^^p a decision is 
The number of samples N of a block as segmented by 65 given that deviation from the standard deviation or the 
windowing with a rectangular window by the window effective value is smaller, that is the signal is flat, and hence 
analysis unit 12 is assumed to be 256, and a train of input the signal is unvoiced, that is noise or unvoiced sound. 
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Referring to FIG. Ic, the peak value data detection unit 16 the peak value bias data P„ is smaller than the threshold 

for detecting peak value data and detection of bias of the value P^;^, a decision is given that the bias of the peak values 

peak values on the time scale, are hereinafter explained. The on the time axis is larger and hence the signal is a voiced 

peak value detection unit 16 is made up of a peak value sound. On the other hand, if the peak value bias data P„ is 

detection uoit 26 for detecting a peak value from sub-block 5 larger than the threshold value P,;^, a decision is given that 

signals from one sub-block to another, a mean peak value deviation of the bias of the peak values on the time scale is 

calculating unit 27 for calculating a mean value of the peak smaller and hence the signal is a noise or an unvoiced sound, 

values from the peak value detection unit 26, and a standard ^ith the above-described first embodiment of the voiced 

deviation calculating unit 28 for calculating a standard sound discrimiDation method according to the present 

deviationfrom the block-by-blocksignalssuppHed from the lo ^n^^^^ion, the decision as to whether the sound signal is 

.J 1 . vi'iT^ 1 1 u- J * *• u voiced IS given on the basis of the bias on the time scale of 

^ndow analysis unit 12. The pc^ value bias detecting unit ^^^^ J^^^^ characteristics, such as peak values, effective 

19 divides the mean peak value from the mean peak value ^^^^^^ ^^^^^^^^ deviation, of the sub-block signals. 

calculaUng unit 27 by the block-by-block standard deviation ^ ^^^^ ^^^^ discriminating device for illustrating the 

value from the standard deviation calculaUng umt 28 to find ^^iced sound discriminating method according to the second 

bias of the mean peak values on the time axis. The mean 15 embodiment of the present invention is shown schematically 

peak value bias data is supphed to decision unit 18. The in HG. 4. With the present second embodiment, a decision 

decision unit 18 decides, based on the mean peak value bias as to whether or not the sound signal is voiced is made on 

data, whether or not the sub-block speech signal is voiced, the basis of the signal level and energy distribution on the 

and outputs a corresponding decision signal at output ter- frequency scale of the block speech signals, 

minal 20. 20 With the present second embodiment, the tendency for the 

Hie principle of deciding from the peak value data energy distribution of the voiced sound to be concentrated 

whether or not the signal is voiced is explained by referring towards the low frequency side on the frequency scale and 

to FIG. Ic. for the energies of the noise or the unvoiced sound to be 

An N/Bi number of sub-block signals, that is 256/8=32 concentrated towards the high frequency side on the fre- 

sub -block signals, having a sub-block length Bjr=8, for 25 quency scale, is utilized. 

example, are supplied to the peak value detection unit 26 via Referring to FIG. 4, digital speech signals, freed of at least 

window analysis unit 12 and sub-block division unit 13. The low-range signals (with frequencies not higher than 200 Hz) 

peak value detection unit 26 detects a peak value P(i) for for elimination of a dc offset or bandwidth limitation lo e.g. 

each of the 32 sub-blocks in accordance with the formula (5) 200 to 3400 Hz by a high-pas filter (HPF), not shown, arc 

30 supplied to an input terminal 31. These signals are trans- 
mitted to a window analysis unit 32. In the analysis unit 32, 

k%nSk+Bi-i each block of the input digital signals consisting of N 

samples, N being 256, are windowed with a hamming 
at 0<i<N/Bi window, so that the input signals are sequentially time- 
where k=ixBj 35 shifted at an interval of a frame consisting of L samples, 
In formula (5), i is an index for sub-blocks and k is the where L equals 160. An overlap between adjacent blocks is 
number of samples while MAX is a function for finding a (N-L) samples or 96 samples. The resulting N-sample block 
maximum values. signals, produced by the window analysis unit 32, are 
The mean peak value calculating unit 27 calculates a transmitted to an orthogonal transform unit 33. The orthogo- 
mean peak value P from the above peak value P(i) in 40 nal transform unit 33 orthogonaUy transforms a sample 
accordance with the formula (6). string, consisting of 256 samples per block, such as by fast 

Fourier transform (FFT), for converting the sample string 

p _ 1 ^/^i-i data into a data string on the frequency scale. The frequency- 

-"WFr~ ,=0 domain data from the orthogonal transform unit 33 are 

A AA ' r 1 w J ui 1 supplied to an energy detectionumt34. The energy detection 

The standard deviation calculatmg umt 28 finds the block- i/i j *t, f j • j i- j * 

u * A .A A • * - -• -1 - *u - umt 34 divides the frequency domam data supphed thereto 

by-block standard deviation in accordance with the ■ ,^„, a , a u- u e a * 4U 

formula (1) low-frequency data and high-frequency data, the ener- 

^ ^ gies of which are detected by a low-frequency energy 

detection unit 34a and a high-frequency energy detection 

J Af-i 50 unit 346, respectively. The low-range energy values and 

^^''\ W „S)^^"^~^^ ' *^^Sh- range energy values, as detected by low-frequency 

energy detection unit 24a and high-frequency energy detec- 

The peak value bias detection unit 19 calculates the peak tion unit Mb, respectively, are supplied to an energy distri- 

value bias data P„ from the mean peak value P and the bution calculating unit 35, where the ratio of the two 

standard deviation a^, in accordance with the formula (8) S5 detected energy values is calculated as energy distribution 

P data. The energy distribution data, as . found by the energy 

distribution calculating uinit 35, is supplied to a decision unit 

It is noted that an effective value calculating unit for 37. The detected values of the low-range and high-range 

calculating an effective value (rms value) may also be energies are supplied to a signal level calculating imit 36 

employed in place of the standard deviation calculating unit 60 where the signal level per sample is found. The signal level 

28. data, as calculated by the signal level calculating unit 36, is 

The peak value bias data P„, as calculated in accordance supphed to decision unit 37. The unit 37 decides, based on 

with formula (8), is a measure for bias(localized presence) the energy distribution data and the signal level data, 

of the peak values on the time scale, and is transmitted to whether the input speech signal is voiced, and outputs a 

decision unit 18. The decision unit 18 compares the peak 65 corresponding decision data at an output terminal 38. 

value bias data P„ to the threshold value P^^„ to decide The operation of the above-described second embodiment 

whether or not the signal is a voiced sound. For example, if is hereinafter explained. 
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Hie number of samples N of a block as segmented by 
windowing with a hamming window by the window analysis 
unit 12 is assumed to be 256, and a train of input samples is 
indicated x(a). The time -domain data, consisting of 256 
samples per block, are converted by the orthogonal trans- 
form unit 33 into one -block frequency-domain data. These 
one-block frequency-domain data are supplied to the energy 
detection unit 34 where an amplitude a„(j) is found in 
accordance with the formula (9) 



(9) 



where R^(j) and I(j) indicate a real number part and an 
imaginary number part, respectively, and i indicates a num- 
ber of samples of not less than 0 and less than N/2 (=128 
samples). 

The low-eoergy detection unit 34a and 34^? high energy 
detection unit of the energy detection unit 34 find the 
low-range energy and the high-range energy S^, 
respectively, from the amplitude a„(j) in accordance with the 
formulas (10) and (11) 



15 



20 



Sl 



Sh 



N/4-1 
- X a 

NP.-1 
j~Nf4 



(10) 



(11) 25 



The low range is herein a frequency range of e.g. 0 to 2 kHz, 
while the high range is a frequency range of 2 to 3.4 kHz. 
The low-range energies and the high-range energies 3^=^, 
as calculated by the formulas (10), (11), respectively, are 
supplied to distribution calculating unit 35 where energy 
distribution balance data, that is energy distribution data on 
the frequency axis f^, is found based on the ratio S^/S^^. That 
is, 

A-^JSm (12) 

The energy distribution data f^ on the frequency scale is 
supplied to decision unit 37 where the energy distribution 
data f^ is compared to a predetermined value f^^^^ to make 
decision as to whether or not the speech signal is voiced. If, 
for example, the threshold f^^,^ is set to 15, and the energy 
distribution data f^ is smaller than f^, a decision is given 
that the speech signal is likely to be a noise or unvoiced 
sound, instead of a voiced sound, because of concentrated 
energy distribution in the high frequency side. 
-On the other hand,- the low-range energies- S^- and the 
high-range energies are also supplied to signal level 
calculation unit 36 where data on a signal mean level 1" is 
found in accordance with the formula 



30 



35 



40 



45 



50 



(13) 



using the low-range energies and the high-range energies 
Sjf. The mean level data 1" is also supplied to decision unit 55 
37. The decision unit 37 compares the mean level data l^ to 
a predetermined threshold 1,;,^ to decide whether or not the 
speech sound is voiced. If, for example, the threshold value 
Ij;^ is set to 550, and the mean level data 1" is smaller than 
the threshold value 1^^,^, a decision is given that the signal is 60 
not likely to be voiced sound, that is, it is likely to be a noise 
or unvoiced sound. 

It is possible with the decision unit 37 to give the 
voiced/unvoiced decision based on one of the energy dis- 
tribution data ff, or the mean level data l^, as described 65 
above. However, if both of these data are used, the decision 
given has improved reliability. That is, with 



h<hb and 1„<1,;,„ 



the speech is decided to be voiced with higher reliability. 
The decision data is issued at output terminal 38. 

Besides, the energy distribution data f^, and the mean level 
data Ifl according to the present second embodiment may be 
separately combined with the ratio p^ which is the bias data 
of the standard deviation values or effective values on the 
time scale according to the first embodiment to give a 
decision as to whether or not the speech signal is voiced. 
That is, if 

Pf^Prhf and fft<f^, or and l^<f^, 

the signal is decided to be not voiced with higher reliability. 

In this manner it is possible with the present second 
embodiment to decide whether or not the speech signal is 
voiced by relying upon the tendency for the energy distri- 
bution of the voiced sound and that of the unvoiced sound 
or noise to be concentrated towards the lower and higher 
frequency range respectively. 

FIG. 5 schematically shows a voiced/unvoiced discrimi- 
nating unit for illustrating a voiced sound discriminating 
method according to a third embodiment of the present 
invention. 

Referring to FIG. 5, speech signals supplied to input 
terminal 11 via window analysis unit 12 and sub -block 
division unit 13 arc freed at least of low-range components 
of less than 200 Hz, windowed by a rectangular window 
with N samples per block, N being e.g. 256, time-shifted and 
divided into sub-blocks, are supplied to a detection unit for 
detecting statistical characteristics. Statistic characteristics 
are detected of the, sub -block signals by the detection unit 
for detecting the statistic characteristics. In the present 
embodiment, the standard deviation data detecting unit 15, 
the effective value data detecting unit 15' or the peak value 
data detection unit 16 is used as such detection unit. The 
standard deviation or effective value bias detection unit 17 
or the peak value bias detection unit 19, explained in the 
preceding first embodiment, detect the localization of the 
statistic characteristics on the time scale based on the 
above-mentioned statistical characteristics. The bias data 
from the localization detection unit 17 or 19 is supplied to 
decision unit 39. The energy detection unit 34 is supplied 
with data freed at least of low-range components of not more 
than 200 Hz by a window analysis unit 42 and an orthogonal 
transfonn unit 33, windowed by a hamming window with N 
samples per block, N being e.g. 256, time-shifted and 
.oithogonaltransformed into.data.on.the frequency scale. The 
frequency-domain data are supplied to energy detection unit 

34. The detected high-range side energy values and the 
detected low-range side energy values are supplied to an 
energy distribution calculation unit 35. The energy distribu- 
tion data, as found by the energy distribution calculation unit 

35, is supplied to a decision unit 39. The detected high-range 
side energy values and the detected low-range side energy 
values are also supplied to a signal level calculating unit 35 
where a signal level per sample is calculated. The signal 
level data, calculated by the signal level calculating unit 36, 
is supplied to decision unit 39, which is also supplied with 
the above-mentioned bias data, energy distribution data and 
the signal level data. Based on these data, the decision unit 
39 decides whether or not the input speech signal is voiced. 
The corresponding decision data is outputted at output 
terminal 43. 

The operation of the present third embodiment is herein- 
after explained. 

With the present third embodiment, the decision unit 39 
gives a voiced/unvoiced decision, using the bias data p^ of 
the sub-frame signals from bias detection units 17, IT or 19, 
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energy distribution data 4 from the distribution calculating 
unit 35 and the mean level data 1" from the signal level 
calculating unit 36. For example, if 

p^p^ and ft<f,;^ and 1"<1^, 
the input speech signal is decided to be not voiced with 
higher reliability. 

In the present third embodiment, a decision as to whether 
or not the input speech signal is voiced is given responsive 
to the bias data of the statistical characteristics on the time 
scale, energy distribution data and mean value data. 

If, in the voiced sound discriminating method according 
to the above-described embodiments, a voiced/unvoiced 
decision is to be given using the bias data p^ of sub-frame 
signals, temporal changes of the data p^-are pursued and the 
sub-block signals are decided to be flat only if 

for five frames on end, so that a flag is set. If 

for one or more of the five frames, the flag P^^ is set to 0. If 

ft<fi, and Pj.,ol and r<U,, 
the input speech signal may be decided to be not voiced with 
extremely high reliability. 

If a decision is given that the signal is not voiced, that is, 
it is the background noise or the consonant, the entire block 
of the input speech signal is compulsorily set to be unvoiced 
sound to eliminate generation of an extraneous sound during 
voice synthesis using a vocoder such as MBE. 

Referring to FIGS. 6, la and 7b, a fourth embodiment of 
the voiced sound discriminating method according to the 
present invention is explained. 

In the above -described first embodiment, the ratio of the 
arithmetical mean to the geometrical mean of standard 
deviation data and cfifective value data is found to check for 
the distribution of standard deviation values and effective 
values (rms values) of the sub-block signals. For finding the 
geometrical mean value, it is necessary to carry out a 
number of times of data multiplication equal to the number 
of sub-blocks in each block, e.g. 32, and a processing of a 
32 nd root for each of the sub-block signals. If 32 data are 
multiphed first, an overflow is necessarily produced, so that 
it becomes necessary to carry out a processing to find a 32 45 
nd root of each sub-block signal prior to multiplication. In 
such case, 32 times of processing to find 32 nd roots are 
required to increase the processing volume. 

Thus, in the present fourth embodiment, the standard 



where BN=32. 
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2^ (Tms{t) - r rn j] 



(16) 



wherein i is an index for the sub-block, such as i=0 to 31, 
is the number of samples in each sub-block or sub-block 
length, such as B^=8, and Bj^ is the number of sub -blocks in 
each block, such as B^32. The number of samples N in 
each block is set to e.g. 256. 

Since the standard deviation according to formula 
(16) is increased with increase in the signal level, it is 
normalized by division with the mean value ims of the 
formula (15), If the normalized standard deviation is 
expressed as a^, 

a„»Cr/»*/nns (17) 

where becomes larger and smaller for a voiced 
speech segment and an unvoiced speech segment or the 
background noise, respectively. Since the speech signal may 
be deemed to be voiced if is larger than a predetermined 
threshold value a,;,, while it may be highly likely to be 
unvoiced or background noise if is smaller than the 
threshold value a^, the remaining conditions, such as the 
signal level or the tilt of the spectrum, are analyzed. The 
concrete value of the threshold value may be set to 0.4 
K.-0.4). 

The reason the above-described analysis of the energy 
distribution on the time scale has been undertaken is that a 
difference in the manner of distribution of the short-time 
effective values (rms values) between the vowel part of the 
speech shown in FIG. la and the consonant part thereof 
shown in FIG. lb is noticed from one sub-block to another. 
That is, the distribution of the short-time effective values 
(rms values) in the vowel part as shown by a curve b in FIG. 
la exhibits a larger bias, while that in the consonant part as 
shown by a curve b in FIG. lb is substantially planar. 
Meanwhile, curves a in FIG. la and lb represent signal 
waveforms or sample values. For analyzing the distribution 
of the short-time rms values, the ratio of the standard 
deviation in each block of the short-time rms values to the 
mean value rms thereof, that is the above-mentioned nor- 
malized standard deviation 0„, is employed in the present 
embodiment. 

An arrangement for the above-mentioned analysis of the 
energy distribution on the time-scale is shown in FIG.-6. 
Input data from input terminal 51 are supplied to an effective 
value calculating unit 61 to find an effective value rms(i) 
from one sub-block to another. This effective value rms(i) is 



deviation and a mean value ms of the effective values ^^Pfi^f ^ ^^^^ standard deviation calculating 

(rms values) of the 32 sub-blocks of each block are found 
and the distribution of the effective values (rms values) is 
detected depending on these values, for example, on the 
ratio of these values. That is, the effective rms value of each 
sub-block, the standard deviation and the mean value 
rms thereof in one block of the 32 sub-blocks, are expressed 
by the formulas (14), (15) and (16): 



55 



where i is over or equal than 0, and less than Bj^.(-32). 



*-0 



unit 62 to find the mean value cms and the standard deviation 
a^. These values are then supplied to a normahzed stan- 
dard deviation value calculating unit 63 to find the normal- 
ized standard deviation which is supplied to a noise or 
unvoiced segment discriminating unit 64. 

The manner of checking of the spectral gradient or tilt is 
hereinafter explained. 

Usually, signal energies are concentrated in the low 
frequency range and in the high frequency range on the 
60 frequency scale with the voiced speech segment and with the 
unvoiced speech segment or background noise, respectively. 
Consequently, the ratio of the high and low range energies 
is taken and used as a measure for evaluation of whether or 
not the segment is a noise segment. That is, an input sample 
0^) 65 train x(n) in one block, supplied from input terminal 51 of 
FIG. 7, where 0^n<N and Na256), is windowed by a 
window analysis unit 52, e.g. with a Hamming window, and 
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processed with FFT by fast Fourier transform unit 53. The 
result of the above -described processing arc indicated by 

Re(j) (0^j<N/2) 

ImO) (0^j<N/2) 
where ReQ) and Im(j) arc real number part and imaginary 
number part of the FFT coefiGcients, respectively. N/2 is 
equivalent to Jt of the normalized frequency and corresponds 
to the real frequency of 4 kHz because x(n) is data resulting 
from sampling at a sampling frequency of 8 kHz. 

The results of the FETT processing are supplied to a 
spectral intensity calculating unit 54 where the spectral 
intensity of each point on the frequency scale a^(j) is found. 

The spectral intensity calculating unit 54 executes a 
processing similar to that executed by the energy detection 
unit 34 of the second embodiment, that is, it executes a 
processing according to formula (9). The spectrum intensi- 
ties a^(j), that is the processing results, are supplied to 
energy distribution calculating unit 55. The unit 55 executes 
processing by energy detection units 34a, 346 of the low- 
range and high-range sides within the energy detection unit 
34, that is processing of the low-range energies according 
to formula (10) and high-range energies according to 
formula (11), as shown in FIG. 4. The unit 55 also finds a 
ratio-parameter fi,^SJSjf, indicating an energy balance, 



detected accurately with a smaller processing volume. By 
compulsorily setting to UV a block decided to be back- 
ground noise, it becomes possible to suppress extraneous 
sound, such as beat caused by noise encoding/decoding. 
5 A concrete example of a multi-band excitation (MBE) 
vocoder, as a typical example of a speech signal synthesis/ 
analysis apparatus (vocoder) to which the method of the 
present invention may be applied, is hereinafter explained. 
The MBE vocoder is disclosed in, for example, D. W. GriEQn 
and J. S. Lim, Multi-band Excitation Vocoder, "IEEE Trans- 
actions Acoustics, Speech and Signal Processing, vol. 36, 
pp. 1223 to 1235, August 1988". With the conventional 
partial auto-correlation (PARCOR) vocoder, speech signals 
are modelled by switching between voiced and unvoiced 
j5 segments on the block-by-block or frame-by-frame basis, 
whereas, with the MBE vocoder, speech signals are modeled 
on an assumption that a voiced segment and an unvoiced 
segment exist in a concurrent frequency domain, that is in 
the frequency domain of the same blocdc or frame. 
20 FIG. 8 shows, in a schematic block diagram, the above- 
mentioned MBE vocoder in its entirety. 

In this figure, input speech signals, supplied to an input 
terminal 101, are supplied to a high-pass filter (HPF) 102 
where a dc oflEsct and at least low-range components of 200 



according to formula (12). If the ratio is low, energy distri- 25 Hz or less for bandwidth limitation to e.g. 200 to 3,400 Hz, 
bution is towards the high range side, so that the signal is 
likely to be a noise or a consonant sound. The parameter f^ 
is supplied to an unvoiced segment discriminating unit 64 or 
discriminating the noise or unvoiced segment. 

The mean signal level 1^, indicated by formula (13), is 
calculated by a mean level calculating unit 56, which is 
equivalent to the signal level calculating unit 36 of the 
preceding second embodiment. The mean signal level is 
also supplied to the unvoiced speech segment discriminating 
unit 64. 

The unvoiced segment discriminating unit 64 for dis- 
criminates the voiced segment from the unvoiced speech 
segment or noise based on the calculated values o^, f^, and 
Ig. If the processing for such discrimination is defined as 
F(*), the following may be recited as specific examples of 
the function F(ct„, f^, 1^) 

By way of a first example, if the conditions 
4<ffcfA and a^<a„rt and 1^<1^,;, 
where f^;,, a^^^ and \^^^ arc threshold values, be satisfied, the 



25 _ 

are efiminatcd. Output signals from filter 102 arc supplied to 
a pitch extraction unit 103 and a window analysis unit 104. 
In the pitch extraction unit 103, the input speech signals are 
segmented by a rectangular window, that is, divided into 
30 blocks, each consisting of a predetermined number N of 
samples, N being e.g. 256, and pitch extraction is made for 
speech signals included in each block. The segmented block, 
consisting of 256 samples, are time shifted at a frame 
interval of L samples, L being e.g. 160, so that an overlap 
35 between adjacent blocks is N-L samples, e.g. 96 samples. 
The window analysis unit 104 multiplies the N-samplc block 
with a predetermined window function, such as a hamming 
window, so that a windowed block is time shifted at an 
interval of L samples per frame. 

Such windowing operation may be mathematically rep- 
resented by 



40 



rJik,q).xiq)w{kL-q) (18) 

wherein k indicates a block number and q the time index of 



speech signal is decided to be a noise and the band in its 45 data or sample number. Thus the above formula indicates 
entirety is set to be unvoiced (UV). As specific examples for. _ that the q*th data. x(q) of pre-processing input data. is. 

the threshold values, f^,,;,, a^,^ and 1^^ may be equal to 15, multiphed by a window function of the k^h block w(kL-q) 

0.4 and 550, respectively. to give data x,^, q). The window function w^r) within llie 

By way of a second example, the normahzed standard pitch extraction unit 103 for a rectangular window shown in 

deviation a^. may be observed for a slightly longer time 50 FIG. 9a is 
period for improving its reliability. Specifically, energy 



distribution on the time domain is deemed to be flat if 
^m'^^mtfi for an M number of consecutive blocks and a 
state flag a,,„„ is set (a,,^„-l). If c^^^.^ for any one or 
more of the locks, the state flag a^tate is reset (o^,^**"^). 
As for the function F(*), the signal is decided to be noise or 
unvoiced if 



0 g r<N 



(19) 



whereas the window function W;,(r) in the window analysis 
unit 104 for the hamming window is 



L<£i^f, and cr„„^« 



1 and L<L, 



0.54-0.46 



with the VAJV flags being all set to UV. 

If the normalized standard deviation a„ is improved in 60 
reliability, as in the second example, checking for the signal 
mean level 1„ may be dispensed with. As for the function 
F(*) in such case, the speech signal may be decided to be 
unvoiced or noise if 

ffc<fftrA and ar,„«=l. 65 
With the above-described fourth embodiment, the back- 
ground noise segment or the unvoiced segment can be 



cos(2jir/(JV - 1) 



(20) 



When employing the window functions w^r) or W;,(r), the 
non-zero segment of the window function w(r) (-w(kL-q)) 
is 

0^kL-q<N 
Modifying this, 
kL-N<qikL 
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Therefore, it is when kL-N<q^kL that the window function 
w^kL-q) is equal to 1 for the rectangular window, as shown 
in FIG. 10. Besides, the formulas (18) to (20) indicate that 
a window of a length N (=256) proceeds at a rate of L (=160) 
samples. The non-zero sample trains at each point N 
(0^r<N), segmented by the window functions of the for- 
mulas (19), (20) are indicated as x^^k, r) and x^^^(k, r), 
respectively. 

In the window analysis unit 104, 0-data for 1792 samples 
arc appended to the 256-sample-pcr-block sample train 
x^ft(k, r), multiplied by the Hamming window according to 
formula (20), to provide 2048 time -domain data string 
which is orthogonal transformed, e.g. fast Fourier 
transformed, by an orthogonal transform unit 105, as shown 
in FIG. U. 

In the pitch extraction unit 103, pitch extraction is per- 
formed on the N-sample-per-block sample train x^X^ 0- 
Pitch extraction may be achieved by taking advantage of 
periodicity of the time wavefonm or the frequency of the 
spectrum or an auto-correlation function. In the present 
embodiment, pitch extraction is achieved by a center clip 
waveform auto -correlation method. Although a clip level 
may be set for each block as the center clip level in each 
block, signal peak levels of the sub-blocks, divided from 
each block, are detected, and the clip levels are changed 
stepwise or continuously within the block in case of a larger 
difference in the peak levels of these sub-blocks. The pitch 
period is determined based on the peak position of the 
auto-correlation data of the center clip waveform. To this 
end, plural peak values are previously found from the 
auto-correlation data belonging to the current frame, 
wherein auto-correlation is found for the N-sample-per- 
block data. If the maximum one of the plural peaks exceeds 
a predetermined threshold, the maximum peak position is 
the pitch period. If otherwise, a peak is found which is 
within a pitch range satisfying a predetermined relation with 
respect to a pitch as fo\ind with frames other than the current 
frame, such as temporally preceding and succeeding frames, 
such as within a pitch range of ±20% with the pitch of the 
temporally preceding frame as center, and the pitch of the 
current frame is determined based on the thus found peak 
position. The pitch extraction unit 103 executes a rough 
pitch search by an open loop operation. Pitch data extracted 
by the unit 103 is supphed to a fine pitch search unit 106 
where a fine pitch search by a closed loop operation is 
executed: ---- - _ __ 

The rough pitch data from pitch extraction unit 103, 
expressed io integers, and frequency-domain data from 
orthogonal transform unit 105, such as fast Fourier trans- 
formed data, are supplied to fine pitch search unit 106. The 
fine pitch search unit 106 swings the data at an interval of 0.2 
to 0.5 by ± several samples, about the rough pitch data value 
as the center, for arriving at an optimum fine pitch data as a 
floating-point number. As the fine search technique, a 
so-called analysis by synthesis method is employed, and the 
pitch is selected so that the synthesized power spectrum is 
closest to the power spectrum of the original sound. 

The fine pitch search is explained. First, with the above- 
mentioned MBE vocoder, the spectral data on the frequency 
domain S(j), obtained by orthogonal transform, such as FFT, 
is supposed to be modelled by the formula 

S(j)'H(f)]^(f)\0<f<r (21) 

where J corresponds to co^4ji=f/Z and to 4 kHz if the 
sampling frequency f^^'OiJTjz is 8 kHz. If, in the above 
formula (21), the spectral data S(j) on the frequency scale 
has a waveform as shown in FIG. 14fl, H0 represents an 
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envelope of the original spectral data S(j), as shown in FIG. 
14fr, while E(j) represents the spectrum of periodic cqui- 
level excitation signals as shown in FIG. 14c. In other 
words, the FFT spectrum S(j) is modelled as a product of the 
5 spectral envelope HQ) and the power spectrum of the 
excitation signals |E(j)|. 

The power spectrum |E(j)| of the excitation signals is 
formed by repetitively arraying the spectral waveform, cor- 
responding to the waveform of a frequency band, from band 
to band on the frequency scale, taking into account the 
periodicity of the waveform on the frequency scale as 
determined depending on the pitch. Such 1-band waveform 
may be formed by fast Fourier transforming the waveform 
shown in FIG. 11, which is the 256-sample hamming 
window function and 0 data for 1792 samples, appended 
thereto, and which herein is deemed to be time-domain 
signals, and by segmenting the resulting impulse waveform 
having a bandwidth on the frequency domain in accordance 
with the above pitch. 

Then, for each of the bands, divided in accordance with 
20 the pitch, an amplitude |A^|, which represents HQ) and 
minimizes the error from band to band, is found. If an upper 
hrait and a lower limit of e.g. the m'th band, that is the band 
of the m'th harmonic, are denoted as a,„, b^^ respectively, an 
error e„ of the m'th band is given by 

25 

(22) 

€^-.2 {m-Mm\v 

Such value of |A,„| as will minimize the error is found 

30 

ae„ bm (23) 
2.Sjra|A„|l£OOl}ra-0 

The error is minimized when the value of |A,„| is such 
as defined by the formula (23). Such amplitude |A^| is found 

40 from band to band and the error e,„ for each band, as defined 
by the formula (22), is found using each amplitude |A,„| 
having the above value. The sum of the errors e„, for aU of 
the bands is then found. The sum 2€^ is found for several 
minutely different pitch values to find a pitch value which 

45 will minimize the error sum 2e^. 
- Specifically, several pitch values above and below each of - 
an integer-valued rough pitch as found by the pith extraction 
unit 103 are provided at a graduation of e.g. 0.25. The error 
sum He^ is found for each of the plural pitch values. It is 

50 noted that, if the pitch is fixed, the band width is also fixed, 
so that the error of formula (22) may be found using the 
power spectrum |S0| and the excitation signal spectrum 
|E(j)| on the frequency scale, in accordance with formula 
(23), and hence the sum Se^ for the totality of the bands may 

55 be found. The sum Ze^ is found for each of the plural pitch 
values to find an optimum pitch value associated with the 
minimum sum value. In this manner, an optimum fine pitch 
having a graduation of 0.25 and the amplitude |A^| associ- 
ated with the optimum pitch may be found at the fine pitch 

60 search unit 106. 

In the above explanation of the fine pitch search, the 
totality of the bands is assumed to be voiced, for simplifying 
the explanation. However, since the model employed in the 
M.BE vocoder is such that unvoiced segments are present on 

65 the concurrent frequency scale, it becomes necessary to 
make voicedAm voiced decision for each of the frequency 
bands. 
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The optimum pitch data and the amplitude data |Aj from 
the fine pitch search unit 106 arc transmitted to a voiced/ 
unvoiced discriminating unit 107 where the voiced/unvoiced 
decision is performed from one band to another. For such 
discrimination, a noise to signal ratio (NSR) is used. That is 
the NSR of the m'th band is expressed by 



(24) 



10 



If the NSR value is larger than a predetermined threshold, 
such as 0,3, that is if an error is larger, for a given band, it 
may be assumed that approximation of |S(j)| by |A^||E(j)| for 
the band is not good, that is that the excitation signal |E(j)| 
is inappropriate as the fundamental, signal, so that the band 
is decided to be unvoiced (UV). If otherv^Hse, it may be 
assumed that approximation is good to a certain extent, so 
that the band is decided to be voiced (V). 

An ampUtude re-evaluation unit 108 is supplied with 
frequency-domain data from orthogonal transform unit 105, 
amplitude data |A^| from fine pitch search unit 106, evalu- 
ated as corresponding to fine pitch, and voicedAinvoiced 
(V/UV) discrimination data from V/UV discrimination unit 
107. The amplitude re -evaluation unit 108 again finds the 
amplitude of the band decided to be unvoiced (UV) by the 
VAJV discriminating unit 107. The amplitude |Aj(^of the 
UV band may be found by the formula 



6« 
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The data from the amphtude reevaluation unit 108 are 
transmitted to a data number conversion unit 109, which 35 
performs an operation similar to a sampling rate conversion. 
The data number conversion unit 109 assures a constant 
number of data, especially the number of amplitude data, in 
consideration of the variable number of frequency bands on 
the frequency scale, above all, the number of amplitude data. 40 
That is, if the effective range is up to 3400 Hz, the effective 
range is divided into 8 to 63 bands, depending on the pitch, 
so that the number vnj^l of amphtude data |A,„|, inclusive 
of the amplitude |A,„|tn^of the UV bands, obtained from one 
band to another, is also changed in a grange of from 8 to 63. 45 
To this end, the data number conversion unit 109 converts 
the number of the variable amplitude data m^ +1 into a 
constant number N^, such as 44. 

In the present embodiment, dummy data are appended to 
amplitude data for an effective one block on the frequency so 
scale which will interpolate from the last data up to the first 
data in the block to increase the number of data to Njr. A 
number of amplitude data which is K^^ times N^r, such as 8 
times are found by bandwidth limiting type oversam- 
phng. The ((mj^^l)xK^^) number of amplitude data are 55 
Unearly interpolated to increase the number of data to a 
larger value N^ such as 2048, which N^^^ number of data are 
sub-sampled to give the above-mentioned predetermined 
number N^- of, e.g. 44, samples. 

The data from the data number conversion unit 109, that 60 
is the constant number N^of amplitude data, are supphed to. 
a vector quantization unit 110, where they are grouped into 
sets each consisting of a predetermined number of data for 
vector quantization. Quantized output data from vector 
quantization unit 110 are outputted at output terminal lU. 65 
Fine pitch data from fine pitch search unit 106 are encoded 
by a pitch encoding unit 115 so as to be outputted at output 
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terminal 112. The V/UV discrimination data from unit 107 
are outputted at output terminal 113. These data from output 
terminals 11 to 113 are transmitted as predetermined format 
transmission signals. 

Meanwhile, these data are produced by processing data in 
each block consisting of N samples, herein 256 samples. 
Since the block is time shifted with the L-sample frame as 
a unit, transmitted data are produced on the frame -by-frame 
basis. That is, the pitch data, V/UV discrimination data and 
amplitude data are updated at the frame period. 

Referring to FIG. 13., an arrangement of the synthesis or 
decoder side for synthesizing the speech signals based on the 
transmitted data is explained. 

Referring to FIG. 13, the vector quantized ampHtude data, 
the encoded pitch data and the V/UV discrimination data are 
supphed to input terminals 121, 122 and 123, respectively. 
The vector quantized amplitude data are supplied to an 
inverse vector quantization unit 124 for inverse quantization 
and thence to data number inverse conversion unit 125 for 
inverse conversion. The resulting amplitude data are sup- 
plied to a voiced sound synthesis unit 126 and to an 
unvoiced sound synthesis unit 127. The encoded pitch data 
from input terminal 122 are decoded by a pitch decoding 
unit 128 and thence supplied to a data number inverse 
conversion unit 125, a voiced sound synthesis unit 126 and 
to an unvoiced sound synthesis unit 127. The V/UV dis- 
crimination data from input terminal 123 are supplied to 
voiced sound synthesis unit 126 and unvoiced sound syn- 
thesis unit 127. 

The voiced sound synthesis unit 126 synthesizes a voiced 
sound waveform on the time scale by e.g. cosine waveform 
synthesis. The unvoiced sound synthesis unit 127 synthe- 
sizes unvoiced sound on the time domain by filtering a white 
noise by a band-pas filter. The synthesized voiced and 
unvoiced waveforms are summed or synthesized at an 
additive node 129 so as to be outputted at output terminal 
130. The amplitude data, pitch data and V/UV discrimina- 
tion data are updated during analysis at an interval of a frame 
consisting of L samples, such as 160 samples. However, for 
improving continuity or smoothness between adjacent 
frames, those ampUtude or pitch data at e.g. the center of 
each frame are used as the above-mentioned amphtude or 
pitch data, and data values up to the next adjacent frame, that 
is the assynthesized frame, are found by interpolation. That 
is, in the synthesized frame, for example, an interval from 
the center of an analytic frame to the center of -the next 
analytic frame, data values at a leading end sampling point 
and at a terminal end sampling point, that is at a leading end 
of the next synthetic frame, are given, and data values 
between these sampling points are found by interpolation. 

The synthesizing operation by the voiced sound synthesis 
unit 126 is explained in detail. 

If the voiced sound of the above-mentioned synthetic 
time-domain frame, consisting of L samples, for example, 
160 samples, for the m'th band, that is the m'th harmonics, 
decided to be voiced (V), is denoted as V^(n), it my be 
expressed by 



(26) 



using the time index or sample number in the synthetic 
frame. The voiced sounds of the bands decided to be voiced 
(V), among the totality of the bands, are summed together 
(^^mW) to synthesize the ultimate voiced sound V(n). 

In the formula (26), A^(n) is an amplitude of the m'th 
harmonics as interpolated between the leading end and the 
terminal end of the synthetic frame. Most simply, it suffices 
to linearly interpolate the values of the m'th harmonics 
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updated from frame to frame. That is, if the amplitude value FIG. 14a shows an example of the spectrum of the speech 

of the m'th harmonics at the leading end (n=0) of the signals wherein the bands having the band numbers or 

assynthesized frame is denoted as A^^ and the amplitude harmonics numbers of 8, 9 and 10 are decided to be 

value of the m'ih harmonics at the trailing end (noL) of the unvoiced, with the remaining bands being decided to be 

synthetic frame, that is at the leading end of the next 5 voiced. The time-domain signals of the voiced and unvoiced 

synthetic frame, is denoted as A^„, it suffices to calculate ^^nds are synthesized by the voiced sound synthesis unit 

A^(n) by the formula "* unvoiced sound synthesis unit 127, respectively. 

The operation of synthesizing the unvoiced sound by the 

A„{n)''(L-n)AcJL+nAj:,jL (27) unvoiccd sound synthcsis unit 127 is explained. 

The time-domain white noise signal wavefonn from white 

The phase e^(n) in the above formula (26) may be found noise generator 131 is windowed by a suitable window 

by the formula function, such as a hamming window, to a predetermined 

a / \ ^ 2 A,. /no\ number, such as 256 samples,and short-time Fourier trans- 

formed by an STFT umt 132 to produce a power spectrum 
where denotes the phase of the m'th harmonics at the of the white noise on the frequency scale, as shown in FIG. 
leading end (n-O) of the synthetic frame (initial phase of the 12b. The power spectrum from unit 132 is supplied to a band 
frame), cooj denotes a fundamental angular frequency at the amplitude processing unit 133 where the spectrum for the 
leading end of the synthetic frame (n-O) and co^i denotes a bands for m»8, 9, 10 decided to be unvoiced is multiplied by 
fundamental angular frequency at the traiKng end (n«L) of the amplitude |A^|t/v^ while the spectrum of the remaining 
the synthetic frame or at the leading end of the next synthetic ^ bands are set to 0, as shown in FIG. 12c. The power 
frame. Aco in the above formula (28) is selected to be amplitude processing unit 133 is supplied with the above- 
minimum so that the phase i^j^^ at n-L became equal to mentioned amplitude data, pitch data and VAJV discrimi- 
e„(L). nation data. An output of the band amplitude processing unit 

The manner of finding the amphtude A^(n) and the phase 133 is suppHed to an ISTFT unit 134 where it is inverse 

e,„(n) for an arbitrary m'th band, depending on the results of ^ short-time Fourier transformed using the phase of the origi- 

V/UV discrimination for n-O and n-L, is hereinafter nal white noise for transforming the frequency-domain 

explained. signal into the time -domain signal. An output of the ISTFT 

If the m'th band is decided to be voiced both for n-O and processing unit 134 is supplied to an weighted overlap -add 

n=L, the amplitude A,„(n) may be found by linear interpo- unit 135 where it is processed with a repeated weighted 

lation of the transmitted values of the ampUtudes Aq^, A^ overlap-add processing on the time scale to enable the 

in accordance with formula (27). Aco is set so that the phase original continuous noise wavefonn to be restored. In this 

6„(n) ranges from 6„(0) equal to for n«0 to BJJL) equal manner, a continuous time-domain waveform is synthesized, 

to 4(Lm for n«L. An output signal from the overlap-add unit 135 is supplied 

If the m'th band is decided to be voiced and imvoiced for to the additive node 129. 

n«0 and n=L, respectively, the amplitude A„(n) is linearly In this manner, signals of the voiced and unvoiced 

interpolated so that the transmitted amplitude value ranges segments, synthesized by the synthesis units 126, 127 and 

from Ao„ for A„(0) to 0 for A„(L). The transmitted ampli- re-transformed to the time -domain signals are mixed at the 

tude value A^„ for n^L is an amplitude value of the additive node 129 at a suitable fixed mixing ratio. The 

unvoiced sound employed at the time of synthesis of the reproduced speech signals are outputted at output terminal 

unvoiced sound as later explained. The phase 6^(n) is set so ^ 130. 

that 0„,(O)=(()o;„ and Aco«0. TTie voiced/unvoiced discriminating method according to 
If the m'th band is decided to be unvoiced and voiced for the present invention may also be employed as means for 
n=0 and for n»L, respectively, the amplitude A^(n) is detecting the background noise for decreasing the environ- 
linearly interpolated so that so that the amplitude A^(0) for mental noise (background noise) at the transmitting side of 
n«0 is 0 and the amplitude value becomes equal to the e.g. a car telephone. That is, the present method may also be 
transmitted valtie A^^ for n=L. The phase e„(n) is set so that ~ employed.for noise detection for so-called speech erihance- 
the phase 9^(0) for n=0 is given by ment of processing the low-quality speech signals mixed 

with noise for eliminating adverse effects by the noise to 



e.(OH»i>»-m(<iioi+(.>ii)L/2 (29) 



provide a sound closer to a pure sound. 



using the phase value (j)^^ at the terminal end of a frame, and so Y^^^^ ^ clamied is: 
Ad) is set so that Aa)=0. ^' ^ method for discriminating a digital speech sound 

The technique of setting Aco so that 6 (L) is equal to (jjj, comprising dividing digital speech signals into signal blocks 
when the m'th band is decided to be voiced both for n=0 and ^^^^ including a predetermined number of samples, and 
n=Lis explained. By setting 0=1 in formula (24), making a decision for each of said signal blocks as to 

whether the speech sound is voiced, said method further 
e„iL) - ma>oiZ. + iMa)xi-a)oi)/2i + 4»^- + A(uL comprising the stcps of: 

transforming signals of each of said signal blocks into 

data on the frequency scale, 
finding low frequency range energies based on said data 
go on the frequency scale, 
Arranging, Aco becomes finding high frequency range energies based on said data 

Ao.HmodM».--4U-«LK;«o.O/2yi (30) on the frequency scale ^ ^ . , ^, ^ 

findmg a mean signal level of each of said signal blocks 
In the above formula (30), mod 2n(x) is function which from low frequency range energies and high frequency 

maps the main value of x by a value between -rc and +n. For 65 range energies, 

example; if x=1.3rt, 2.3jr and -1.3jt, mod 2ji(x) is equal to dividing signals of each of said signal blocks into plural 

-0.7n, 0.33C and 0.7n, respectively. sub-blocks, 
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aDalyzing said sub-blocks to find statistical characteristics 
of each of said sub -blocks, 

calculating a bias of said statistical characteristics of said 
signals in the time domain, and, 

deciding whether or not said signal blocks are voiced by 
comparing said mean signal level with a first predeter- 
mined threshold and by further comparing said bias of 
said statistical characteristics in the time domain with a 
second predetermined threshold. 

2. The method as claimed in claim 1 wherein a decision 
as to whether or not said signal blocks are voiced is made 
further based on a ratio between said low frequency range 
energies and said high frequency range energies. 

3. The method as claimed in claim 1 wherein a ratio 
between low frequency range energies and high frequency 
range energies are found based on said low frequency range 
energies and said high frequency range energies and wherein 
a decision as to whether or not said signal blocks are voiced 
is made by further comparing said ratio with a predeter- 
mined threshold. 

4. The method as claimed in claim 1 wherein said low 
frequency range energies and said high frequency range 
energies are demarcated from each other at a demarcation 
frequency which is between 0 kHz and 3.4 kHz. 

5. The method as claimed in claim 1 further comprising 
the step of: 

finding between said low frequency range energies and 
said high frequency range energies, said ratio being 
used as basis in deciding whether or not said signal 
blocks are voiced. 

6. The method as claimed in claim 1 further comprising 
the steps of: 

finding a ratio between said low frequency range energies 
and said high frequency range energies, and, 

deciding whether or not each of said signal blocks are 
voiced by further comparing said ratio with a prede- 
termined threshold, 

7. A method for discriminating a digital speech sound 
comprising dividing digital speech signals into signal blocks 
each including a predetermined number of samples, and 
making a decision as to whether or not the speech sound is 
voiced for each of said signal blocks, said method further 
comprising the steps of: 

finding anjeffective value^f signals in each of a plurality 

of sub-blocks divided from each of said signal blocks, 
finding a standard deviation and a mean value of said 

signals of each signal block based on the effective value 

as found for each of said sub-blocks, 
finding a normalized standard deviation in the time 

domain based on said standard deviation and said mean 

value, 

frequency-analyzing signals of each of said signal blocks 
to find spectral intensities at a plurality of frequencies, 

finding an energy distribution based on said spectral 
intensity at each of said plurality of frequencies, 

finding a mean signal level of signals of each of said 
signal blocks from said energy distribution, and, 

making a decision as to whether or not said signal blocks 
are voiced by comparing said normalized standard 
deviation, said energy distribution and said mean signal 
level with each corresponding predetermined thresh- 
old. 

8. The method as claimed in claim 7 wherein said spectral 
intensities at each point of the frequency domain are divided 
into groups of low-range frequency and high-range fre- 
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quency and wherein said energy distribution is found based 
on a ratio between energies of the respective groups. 

9. The method as claimed in claim 8 wherein said low 
frequency range energies and said high frequency range 

5 energies are demarcated from each other at a demarcation 
frequency which is between 0 kHz and 3.4 kHz. 

10. An apparatus for discriminating a digital speech sound 
by dividing digital speech signals into signal blocks each 
including a predetermined number of samples, and making 

10 a decision for each of said signal blocks, as to whether or not 
the speech sound is voiced, said apparatus comprising: 
frequency data calculating means for transforming signals 
of each of said signal blocks into frequency-domain 
data, 

means for finding low frequency range energies based on 

said frequency-domain data, 
means for finding high frequency range energies based on 

said frequency-domain data, 
20 means for finding a mean signal level of each of said 

signal blocks from said low frequency range energies 

and said high range energies, 
means for dividing signals of said signal block into plural 

sub-blocks, 

25 means for analyzing said sub-blocks for finding statistical 
characteristics of each of said sub-blocks, 

means for calculating a bias of said statistical character- 
istics of said signals in the time domain, and, 

decision means for making a decision as to whether or not 
said signal blocks are voiced by comparing said mean 
signal level with a first predetermined threshold and by 
further comparing said bias of said statistical charac- 
teristics in the time domain with a second predeter- 
mined threshold. 

11. The apparatus as claimed in claim 10 wherein said 
decision means decides whether or not said signal blocks are 
voiced further based on a ratio between said low frequency 
range energies and said high frequency range energies. 

^ 12. The apparatus as claimed in claim 10 further com- 
prising: 

means for finding a ratio between said low frequency 
range energies and said high frequency range energies 
based on said low frequency range energies and said 
45 high frequency range energies wherein said decision 
- - -means decides whether or not said sigiial blocks are 
voiced by further comparing said ratio with a prede- 
termined threshold. 

13. The apparatus as claimed in claim 10 wherein said low 
5Q frequency range energies and said high frequency range 

energies are demarcated from each other at a demarcation 
frequency which is between 0 kHz and 3.4 kHz. 

14. The apparatus as claimed in claim 10 further com- 
prising: 

55 means for finding a ratio between said low frequency 
range energies and said high frequency range energies, 
said ratio being used as basis in deciding whether or not 
said signal blocks are voiced. 

15. The apparatus as claimed in claim 10 further com- 
60 prising: 

means for finding a ratio between said low frequency 
range energies and said high frequency range energies, 
wherein said decision means decides whether or not 
said signal blocks are voiced by further comparing said 
65 ratio with a predetermined threshold. 

16. An apparatus for discriminating a digital speech sound 
by dividing digital speech signals into signal blocks each 
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including a predetermined number of samples, and making 
a decision for each of said signal blocks as to whether or not 
the speech sound is voiced, said apparatus comprisiog: 
means for finding an effective value of signals in each of 

a plurality of sub-blocks divided from each of said ^ 

signal blocks, 

means for finding a standard deviation and a mean value 
of said signals of each signal block based on an 
effective value as found for each of said sub -blocks, 

means for finding a normalized standard deviation in the 
time domain based on said standard deviation and said 
mean value, 

means for frequency-analyzing signals of each of said 
signal blocks to find spectral intensities at a plurality of 
frequencies, 



455 

24 

means for finding enetgy distribution based on said spec- 
tral intensity at each of said plurality of frequencies, 

means for finding a mean signal level of signals of each 
of said signal blocks from said energy distribution, and, 

decision means for deciding whether or not said signal 
blocks are voiced by comparing said normalized stan- 
dard deviation, said energy distribution and said mean 
signal level with each corresponding predetermined 
threshold. 

17. The apparatus as claimed in claim 16 wherein said 
spectral intensities at each point of the frequency domain are 
divided into groups of low-range frequency and high-range 
frequency and wherein said energy distribution is found 
based on a ratio between energies of the respective groups. 
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